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Although it is true that artificial intelligence and data science have become 
key tools that contribute to the improvement of many processes, identifying 
patterns and contributing to decision making, however, there are 
environments in which they are not yet being using it relevantly and 
effectively. The objective of this study is to identify the relevant factors, 
based on the opinions expressed by the students through the social network 
Twitter regarding the perception of satisfaction with the teaching 
performance during the virtual learning environment. For which sentiment 
analysis and text mining are used under the Python programming language 
environment, through JupyterLab. As results, it was determined that a 
predominance of 57.27% of positive polarity, identifying that the relevant 
factors of student satisfaction with teaching performance, are related to the 
development of the teacher in the class sessions that contributes to the 
learning of the process control subject through the use of simulation tools 


such as simulink and tools linked to proportional integral derivative (PID) 
controllers; on the other hand, there is a percentage of negative polarity of 
15.45% that belongs to the factors linked to the laboratory sessions in which 
graphic representation and block diagrams were used to explain the class 
session. 
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1. INTRODUCTION 

The quality of the service within the educational field, without a doubt, has become an indicator of 
good organizational practices at a global level [1], [2]. In the educational field, service quality has been 
related to a set of attributes of institutional management on which students evaluate the fulfillment of their 
expectations associated with their academic experience [3], [4]. The evaluation of academic quality is 
optimized when opinions and indicators of student satisfaction are incorporated, since the university student 
represents the main actor in the study of the quality of institutions [5], [6]. Likewise, in [7], [8] it is pointed 
out that university satisfaction is one of the fundamental objectives to determine the quality of virtual training 
in educational organizations; being a key factor the capacity and performance of the teachers in offering a 
response according to the expectations and needs of the students. Regarding teaching performance, in the 
virtual environment, in [9], [10] it is pointed out that the interactions between the teacher-student take place 
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through digital tools through information and communication technologies (ICT). Traditionally, the 
measurement of satisfaction is carried out through surveys. However, in the context of virtualization, it seeks 
to implement new methods and procedures, taking advantage of the increased use of the Internet and the 
production of data that it generates [11]. 

Today on the Internet millions of data are produced due to the massive use of social networks. All 
this large amount of data is attractive to different commercial, industrial and academic organizations [12], 
[13]. In addition to this, users actively participate on the internet leaving their own comments, opinions and 
reviews on all kinds of topics, due to this, researchers have been working for several decades on systems that 
allow large amounts of text to be analyzed automatically using techniques natural language processing (NLP) 
and data mining, among others [14], [15]. Indeed, one of the most important transformations today is the 
exponential growth of data that opens the possibility of new forms for the analysis of reality [16]. Thus, the 
value of information no longer lies in specific data, but in the way in which massive data is correlated to 
discover hidden patterns, trends and associations [17]. Until now, big data analysis has been used in different 
disciplines of scientific research whose purpose has been to solve complex problems, in the fields of 
environment, education, health, transportation, national security and biomedicine [18]. Data science is made 
up of three areas, big data, which is used to process data, data mining, whose purpose is to find patterns, and 
data visualization, whose purpose is to facilitate the understanding of information in a clear way and promote 
their socialization [19]. With regard to data mining, it is considered as the analysis of observed data sets, 
generally large volumes, with the aim of finding new relationships between variables, as well as the correct 
summary of said data sets in an understandable and useful [20]. 

As indicated, data mining encompasses a series of modeling techniques aimed at different 
requirements of organizations, with respect to the educational sector, the use of these techniques can be seen 
in a series of works aimed at improving the quality of university education [21], [22]. Within data mining, 
text mining is found, which is the process of finding previously unknown implicit patterns that may be useful 
from a text repository [23]. Text mining is embedded in NLP techniques, which is a field of computer 
science, artificial intelligence and linguistics that studies the interactions between computers and human 
language, through the syntactic, semantic, pragmatic and morphological analysis; these rules, in combination 
with the information stored in computational dictionaries, define the patterns to be recognized through 
images, text and voice [24], [25]. Thanks to the expansion of technology, recent years have seen an 
exponential growth in subjective information available on the Internet. This phenomenon has given rise to 
interest in sentiment analysis (SA), a NLP task that identifies opinions related to an object [26]. SA, is an 
area of NLP, focuses on detecting, extracting, analyzing and classifying opinions, sentiments, evaluations and 
emotions published on different social networks (Facebook, Twitter, LinkedIn, among others) towards 
certain organizations or organizations people [27]. As indicated in [28] sentiment analysis seeks to extract an 
opinion and determine its polarity (positive, negative or neutral). Likewise, in [29], [30] it is pointed out that 
being the social network Twitter one of the most commonly used tools to determine the opinions of users 
about a wide variety of topics, this is not unrelated to the fact that, in the environment educational, sentiment 
analysis is used to evaluate student performance, teacher performance, academic desertion, among other 
relevant issues that aim to improve educational quality. 

Given the above, this research aims to identify the most relevant factors, according to the 
satisfaction of university students with teaching performance during the virtual learning environment, 
through the text mining and sentiment analysis approach. In this sense, the research is organized as follows; 
Section 2 describes the literary review or state of the art on sentiment analysis and the assessment of polarity 
in the context of the use of social networks, section 3 describes the methodology and the proposed 
development, section 3 shows the results obtained, section 4 discusses the results, and finally section 5 
presents the conclusions and future work. 


2. LITERARY REVIEW 

Understanding and processing emotions through sentiment analysis is an important and key factor in 
the approach to the emulation of the intelligence of the human being, as well as to identify the polarity related 
to a feeling, this way of approaching behavior analysis of people has aroused interest in the academic and 
scientific community [31]. With the great influx of information that is handled through social networks due 
to the high demand for active participation of people in these times, there has been an almost exponential 
increase in subjective data available on the internet, which has generated great interest. In establishing 
evaluations regarding the statements contained in these messages, in this way it is necessary and 
preponderant the need to use tools that detect, extract and structure said subjective information [32]. In a 
study on sentiment analysis applied to the processing of emotions and opinions of university students, it is 
concluded that sentiment analysis is a highly complex technique, both due to the implications of data 
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processing that involves NLP, cleaning and filtering, construction of classifiers, evaluation of results. 
However, it adds that it is possible to carry out sentiment analysis in different instruments for assessing 
university student satisfaction, allowing an efficient allocation of resources to programs or faculties within 
educational institutions [33]. Sentiment analysis through natural language processing is relevant since it 
allows to express evaluations about satisfaction linked to a product or service provided, being possible to 
obtain data to process from social networks [20]. However, performing sentiment analysis through any social 
network implies the use of techniques for data pre-processing, which in principle can be used the specific 
technique of Twitter or the traditional processing technique, in this regard, making an analysis on the 
performance of these pre-processing techniques, it was obtained that the results of the analysis in relation to 
precision, establish that both techniques have viable results, in the treatment of data, while in specific stages 
such as tokenization, labeling and elimination of empty words, in this sense, the specific techniques of 
Twitter show significant advantages [34]. 


3. METHOD 
3.1. Research level and study unit 

The research is of a non-experimental level, because no action is carried out on the students, which 
generates some type of alteration or modification of their comments made through the social network Twitter 
regarding satisfaction with the teaching performance. In this sense, the data is treated in its natural condition. 
In addition, the data will be collected weekly in each class session. The research work takes as a unit of 
analysis the comments or opinions expressed on the social network Twitter by the students enrolled in the 
course of automatic process control, of the professional school of mechanical and electrical engineering of 
the National Technological University of South Lima (UNTELS). 


3.2. Description of the collected data 

The sentiment analysis and text mining is developed under the Python programming language, 
through the JupyterLab environment, which will show the metrics of the Twitter account regarding the 
comments or opinions (tweets) made in each class session during the virtual learning environment. The 
acquisition of the data takes a period of 5 weeks, and is comprised from week 9 to week 13, with the purpose 
of obtaining a first result and then applying this model to the analysis of the 16 weeks of class that comprise a 
cycle academic. It should be noted that the total impressions obtained in the 5 weeks of data acquisition were 
455, and the number of interactions or comments represented by "tweets" were 220. Figure 1 shows the code 
developed in Python. 


df2=pd.read_csv("tweet_activity_metrics_process_control.csv") 


df2.head() 


Figure 1. Code developed in Python to obtain tweet metrics 


As a result of the execution of the code shown, the information of the “tweet id” is obtained, which 
represents the identification code of the tweets generated by each class session. Another important aspect to 
highlight in the data collection is the evaluation of the "impressions" and "interactions" whose rate of 
impressions during week 9 to week 13, were 0.529 and 0.433, respectively; evidencing on average a 50% 
relationship between the visits that the account had and the participation in the comments; it is also evident 
that the participation of students with their comments on the Twitter account was 100%, since the number of 
students enrolled in the automatic process control subject was 25. Figure 2 shows the metrics of tweets 
generated during the data collection period. 
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Figure 2. Metrics of generated tweets 
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3.3. Conditioning, sentiment polarization analysis and data extraction 

Figure 3 shows the model for the identification of relevant factors of university student satisfaction 
with teaching performance during the virtual learning environment. As an initial part of the development, the 
acquisition of comments or opinions expressed in the Twitter social network is carried out, this stage consists 
of storing the data in comma separated values (CSV) format, conditioning the texts through the Python 
library and in this way performing the sentiment analysis using the inverse term frequency and document 
frequency (IF-IDF) approach, which has as its output the polarization of sentiments (positive, negative and 
neutral) and the extraction of data that will allow determining the relevant factors. 


Converting text to numeric data and 
extracting information through the TF-IDF 
approach 


a 


Understanding and 
processing 


i j Twitter Social ! ' g ' 
H Network i C3 1 Data stored in CSV format ] 


Python ' 


Output 2 Relevant factors of satisfaction 


Figure 3. Model for the identification of relevant factors of student satisfaction 


The model (Figure 3) shown specifies that from the data provided by the Twitter social network in 
file format with a "csv" extension, these will be conditioned to obtain the polarization of the comments made 
by the Twitter social network through the analysis of sentiments. The conditioning or pre-processing consists 
of eliminating repeated texts, empty spaces that are found in excess, between text and text, as well as 
converting all texts to lowercase. next, the sentiment analysis of the tweets written in english is carried out, 
using Python libraries such as “nltk” and “vader”; the result obtained will be the quantification of the 
sentiment contained in each comment or opinion written by the student; the same ones that are classified in 
positive polarity (greater than 1), neutral polarity (equal to 0) and negative polarity (less than 1), representing 
the output (output 1) of the model used; the same one that will represent relevant information for decision 
making in the improvement of the virtual teaching-learning process of the process control subject. 

In the proposed model (Figure 3) it is observed that the last stage is related to the extraction of texts, 
through the term frequency-inverse dense frequency (TF-IDF) approach; This indicator TF-IDF will allow 
selecting the texts that appear the most in each of the comments or opinions and at the same time that appear 
the least globally. Through this stage, the model allows identifying the most relevant factors, according to the 
satisfaction of university students with the teaching performance during the virtual learning environment. 
This second output (output 2) of the proposed model will also provide the value of TF-IDF, which represents 
the weight of a text “i” in a tweet “j”. As indicated in [28] TF-IDF provides the best performance of Fscore 
approximately 77.8%~94.8%, compared to other approaches, so its use guarantees the functionality of your 
application in determining the most relevant aspects of the satisfaction. 


66599 
1 


3.4. Conditioning, sentiment polarization analysis and data extraction 

To carry out this stage, the programming code is generated that will allow the file provided by 
Twitter to be loaded or read in the JupyterLab environment, called “tweets _process_control_29Dic21.csv”. 
In addition, the "pandas" library will be imported into the program, with which it will be possible to read and 
write data, in different formats or extensions, in this investigation the file has a "csv" extension; Likewise, the 
“numpy” library will be used to create vectors or matrices in which the texts will be linked with their 
corresponding sentiment and polarity. Once this procedure has been carried out, the conditioning or pre- 
processing of the texts is carried out. Figure 4 shows the code used to carry out the conditioning of the texts 
contained in the comments on the Twitter account, using the Python libraries (“nltk. sentiment.vader”, 
“SentimentIntensity Analyzer”, “re”). This stage is responsible for eliminating repeated texts, individual 
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characters that are isolated, empty spaces that are too many, between text and text, as well as converting all 
texts to lowercase. 


from nltk.sentiment.vader import SentimentIntensityAnalyzer 


import nltk 


import re 


features = df.iloc[:,0].values 


processed_features=[] 


for sentence in range(®, len(features)): 
processed_feature=re.sub(r'Wi',' 


* ,ste(features[sentence])) 


processed_feature=re.sub(r's+[a-zA-Z]s+"," *,processed_feature) 


processed_feature=re.s: **Ta-zA-Z]s+',' ",processed_feature) 


processed_feature=re.sub(r's+',' ',processed_feature, flagsere.I) 
processed_feature=re.sub(r'*bs+',' *,processed_feature) 
processed_feature=processed_feature. lower() 


processed_features.append(processed_feature) 


Figure 4. Script for conditioning or pre-processing texts 


Once the texts had been conditioned, and all of them normalized, we proceeded to determine the 
sentiment contained in each of thee tweets, written by the students of the process control subject. As 
evidenced in Table 1, the value of the sentiment of the first 5 tweets is shown, from which it was possible to 
assign the polarity of the sentiments. With this, the results of satisfaction will be obtained from the 
comparison of the number of comments with positive and negative polarity. 


Table 1. Result of the sentiment contained in each tweet 


Comment number _ Comments on the satisfaction of the teaching performance _ sentiment 
1 It has been a good way to deepen the topic ... 0.4404 
2 I would have liked the problems proposed in ... 0.2960 
3 I found it interesting that the maximum overshot ... 0.4019 
4 The statements are the same as those learned... 0.0000 
5 There are topics that were not clear in other... -0.2924 


Finally, the programming code is executed, which seeks to extract the most relevant texts, contained 
in the tweets, with positive and negative polarity, since these are closely related to the satisfaction and 
dissatisfaction of teacher performance during the virtual learning environment. The code makes use of 
libraries such as “stopwords” and “Brown” imported from the nltk.corpus library, the same ones that allow 
text processing. being the approach used for the extraction of TF-IDF texts; the “TfidfVectorizer” function is 
used; this function allows establishing the “max features” parameter, which represents the number of 
“words” that appear most frequently and that must be extracted from all the tweets with positive and negative 
polarity generated by the students. This also specifies the “min df’ parameter, which establishes the 
minimum number of tweets in which the extracted words must be found; On the other hand, the “max_df” 
parameter is also specified, which represents the percentage of tweets in which the extracted texts must be 
found at most. In the investigation of [24] it is pointed out that it was possible to identify 93 general stop 
words, which helped in the determination of relevant factors of satisfaction. Furthermore, incorporating the 
stopword removal preprocessing step into the tamil text pooling, an improvement of 2.4 in F-score for TF- 
IDF was observed. Similarly, in [14] it is indicated that there is a significant improvement in Fscore when the 
data set is used after removing stopwords, so it is shown that the impact of removing stopwords is greater in 
TF-IDF than in TF-IDF that of fasttext. Likewise, in [28] it is pointed out that TF-IDF provides the best 
performance in determining the most relevant aspects of satisfaction. This was reflected in its results, where 
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the system made an F score with 0.939 using TF-IDF, then around 0.899 in the TF, due to this, it is indicated 
that the system will be fast and reliable. 


4. RESULTS AND DISCUSSION 

Figure 5 shows the result of the polarity of sentiments obtained through the comments or opinions of 
university students, carried out through Twitter, this analysis was carried out through Python. As shown in 
Figure 5, there is evidence of the predominance of positive sentiments in the comments or opinions expressed 
on the social network Twitter regarding satisfaction with teaching performance, finding a percentage 
distribution contained in the tweets of 57.27% of positive polarity, of 27.27% neutral polarity and 15.45% 
negative polarity. 


120 


TWITTES 


POLARIZATION POLARIZATION POLARIZATION 
POSITIVE NEUTRAL NEGATIVE 


Figure 5. Distribution of the polarity of sentiments 


Next, the extraction of predominant texts, generated under the TF-IDF approach, is considered, for 
which it is defined that "max_features" is equal to 10, "min_df" equal to 10 and "max_df" equal to 0.7. That 
is, that the 10 words are extracted most frequently, and that these appear in at least 10 tweets simultaneously 
and that they appear in a maximum of 70% of all tweets. In Figure 6, the results of the tweets with positive 
polarity are shown. 


print(vectorizer.vocabulary_) 

{'class': 1, 'interesting': 2, 'laboratory': 3, ‘understand’: 7, ‘able’: @, 'us': 8, 'teacher': 6, ‘use’: 9, 'simulin 
k': 5, 'pid': 4} 

print (vectorizer.idf_) 


[2.28883902 2.54835022 2.13061502 2.54835022 2.79966465 2.70869287 
2.89974811 2.89974811 3.13613689 313613689) 


Figure 6. Extraction of words with positive polarity 


As shown in Figure 6, the terms with the highest IDF turned out to be "simulink" and "pid", with a 
value of 3.1361, this result allows establishing that the relevant factors that contribute significantly to student 
satisfaction during the learning environment virtual, are those related to teaching performance in class 
sessions that allow understanding of laboratory sessions through the use of simulation tools such as simulink 
and PID (proportional-integrator-derivative) controllers. The results are similar to those obtained in [30] 
where 76.6% of the tweets are marked as positive; this shows that the students, who actively use Twitter, 
have a good perception and experience in the use of online learning applications, likewise, the large number 
of positive sentiments towards the virtual course is related to the study material used, which is easy to 
understand. In [27] the study of perception based on sentiment analysis, on the development of academic 
activities by students during the confinement period, a positive polarity of 33.24% and a percentage of 
contribution of the negative polarity of 17.73 %, which is equivalent to a ratio of 2 to 1, allowing the 
conclusion that Systems Engineering students, despite the adaptation process, have a positive perception of 
the academic processes and look optimistically at the possibility of improving the practices developed in 
learning virtual. 

In order to establish the same level of performance and precision in the word extraction process, 
Figure 7 shows the results of the negative polarity tweets. regarding Figure 7, the terms with the highest IDF 
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turned out to be “diagrams” and “block”, with a value of 2.1574; These extracted texts allow us to establish 
that the relevant factors that generated student dissatisfaction during the virtual learning environment were 
linked to the laboratory sessions in which graphic representation and block diagrams were used to explain the 
class session; evidently it was not possible for the students to understand in a significant way the handling of 
these tools. regarding the identification of satisfaction and dissatisfaction factors related to teaching 
performance through the acquisition of comments or opinions expressed on the social network Twitter, in 
[22] it is pointed out that discussion forums have a great impact on the student's experience, because 
discussion forums are a means for students to clearly express their thoughts, opinions, and struggles 
regarding learning content, likewise, textual interactions by students enhance sentiment and cognitive 
engagement toward the course. As indicated in [24], the predictive model through sentiment analysis allowed 
students of the applied basic statistics subject to be motivated and satisfied to use the application of data 
science during the virtual teaching-learning process. 


print(vectorizer.vocabulary_) 
{‘laboratory’: 4, ‘graphs’: 3, ‘able’: @, ‘block’: 1, ‘diagrams’: 2} 
print (vectorizer.idf_) 


[1.84729786 1.9903987 2.15745279 2.15745279 2.15745279] 


Figure 7. Extraction of words with positive negative 


The results obtained show that it is possible to obtain a reference point on student satisfaction using 
text mining techniques and sentiment analysis, in the same way as indicated in [11] it is possible to perform 
sentiment analysis in different assessment instruments of student satisfaction allowing an efficient allocation 
of resources to programs or faculties within educational institutions. for its part, in [23] where the EmotiBlog 
was evaluated through sentiment analysis, focusing on the automatic detection and classification of sentences 
with subjective information, the validity of EmotiBlog was demonstrated, suggesting for this reason to apply 
the developed model. 


5. CONCLUSION 

This study shares the ideas of various authors on the use of text mining and sentiment analysis in the 
educational field to improve teaching-learning conditions, since it allowed the construction of a predictive 
model on the satisfaction of university students with teaching performance during the virtual learning 
environment using the Python programming language through the JupyterLab environment. The results of the 
study show regarding the polarity that there is a predominance of 57.27% of positive sentiments in the 
comments or opinions expressed on the social network Twitter, identifying that the relevant factors of student 
satisfaction during the virtual learning environment are related with the teaching performance in the class 
sessions that allow to achieve the understanding of the sessions through the use of the laboratory through 
simulation tools such as simulink and PID controllers; on the other hand, there is a percentage of negative 
polarity of 15.45% that belongs to the factors linked to the laboratory sessions in which graphic 
representation and block diagrams were used to explain the class session. 

For future work, it is recommended to expand the line of research, through a predictive study and 
implement sentiment analysis and text mining techniques in the satisfaction analysis of the dimensions that 
make up university quality, in order to make timely decisions that allow improving the teaching-learning 
process during the semester studied. 
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